{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
    "<div class=\"align-center\">\n",
    "<a href=\"https://unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
    "<a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n",
    "<a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a></a> Join Discord if you need help + \u2b50 <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> \u2b50\n",
    "</div>\n",
    "\n",
    "To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).\n",
    "\n",
    "You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### News"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Introducing FP8 precision training for faster RL inference. [Read Blog](https://docs.unsloth.ai/new/fp8-reinforcement-learning).\n",
    "\n",
    "Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).\n",
    "\n",
    "[gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!\n",
    "\n",
    "Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.\n",
    "\n",
    "Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "%%capture\nimport os, re\nif \"COLAB_\" not in \"\".join(os.environ.keys()):\n    !pip install unsloth\nelse:\n    # Do this only in Colab notebooks! Otherwise use pip install unsloth\n    import torch; v = re.match(r\"[0-9]{1,}\\.[0-9]{1,}\", str(torch.__version__)).group(0)\n    xformers = \"xformers==\" + (\"0.0.33.post1\" if v==\"2.9\" else \"0.0.32.post2\" if v==\"2.8\" else \"0.0.29.post3\")\n    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo\n    !pip install sentencepiece protobuf \"datasets==4.3.0\" \"huggingface_hub>=0.34.0\" hf_transfer\n    !pip install --no-deps unsloth\n!pip install transformers==4.56.2\n!pip install --no-deps trl==0.22.2"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unsloth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "QmUBVEnvCDJv",
    "outputId": "3824f024-13a6-40d6-f30f-03e9161d0733"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==((====))==  Unsloth: Fast Gemma patching release 2024.4\n",
      "   \\\\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.\n",
      "O^O/ \\_/ \\    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.\n",
      "\\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.\n",
      " \"-____-\"     Free Apache license: http://github.com/unslothai/unsloth\n"
     ]
    }
   ],
   "source": [
    "from unsloth import FastLanguageModel\n",
    "import torch\n",
    "\n",
    "max_seq_length = 4096  # Gemma sadly only supports max 8192 for now\n",
    "dtype = (\n",
    "    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+\n",
    ")\n",
    "load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.\n",
    "\n",
    "# 4bit pre quantized models we support for 4x faster downloading + no OOMs.\n",
    "fourbit_models = [\n",
    "    \"unsloth/mistral-7b-bnb-4bit\",\n",
    "    \"unsloth/mistral-7b-instruct-v0.2-bnb-4bit\",\n",
    "    \"unsloth/llama-2-7b-bnb-4bit\",\n",
    "    \"unsloth/llama-2-13b-bnb-4bit\",\n",
    "    \"unsloth/codellama-34b-bnb-4bit\",\n",
    "    \"unsloth/tinyllama-bnb-4bit\",\n",
    "    \"unsloth/gemma-7b-bnb-4bit\",  # New Google 6 trillion tokens model 2.5x faster!\n",
    "    \"unsloth/gemma-2b-bnb-4bit\",\n",
    "]  # More models at https://huggingface.co/unsloth\n",
    "\n",
    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
    "    model_name = \"unsloth/codegemma-7b-bnb-4bit\",  # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B\n",
    "    max_seq_length = max_seq_length,\n",
    "    dtype = dtype,\n",
    "    load_in_4bit = load_in_4bit,\n",
    "    # token = \"hf_...\", # use one if using gated models like meta-llama/Llama-2-7b-hf\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SXd9bTZd1aaL"
   },
   "source": [
    "We now add LoRA adapters so we only need to update 1 to 10% of all parameters!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "6bZsfBuZDeCL",
    "outputId": "0f03fa94-1657-4ee9-a505-47b02a57e3c1"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Unsloth 2024.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.\n"
     ]
    }
   ],
   "source": [
    "model = FastLanguageModel.get_peft_model(\n",
    "    model,\n",
    "    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n",
    "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
    "                      \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
    "    lora_alpha = 16,\n",
    "    lora_dropout = 0, # Supports any, but = 0 is optimized\n",
    "    bias = \"none\",    # Supports any, but = \"none\" is optimized\n",
    "    use_gradient_checkpointing = \"unsloth\", # True or \"unsloth\" for very long context\n",
    "    random_state = 3407,\n",
    "    use_rslora = False,  # We support rank stabilized LoRA\n",
    "    loftq_config = None, # And LoftQ\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vITh0KVJ10qX"
   },
   "source": [
    "<a name=\"Data\"></a>\n",
    "### Data Prep\n",
    "We now use the `ChatML` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. ChatML renders multi turn conversations like below:\n",
    "\n",
    "```\n",
    "<|im_start|>system\n",
    "You are a helpful assistant.<|im_end|>\n",
    "<|im_start|>user\n",
    "What's the capital of France?<|im_end|>\n",
    "<|im_start|>assistant\n",
    "Paris.\n",
    "```\n",
    "\n",
    "**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).\n",
    "\n",
    "We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.\n",
    "\n",
    "Normally one has to train `<|im_start|>` and `<|im_end|>`. We instead map `<|im_end|>` to be the EOS token, and leave `<|im_start|>` as is. This requires no additional training of additional tokens.\n",
    "\n",
    "Note ShareGPT uses `{\"from\": \"human\", \"value\" : \"Hi\"}` and not `{\"role\": \"user\", \"content\" : \"Hi\"}`, so we use `mapping` to map it.\n",
    "\n",
    "For text completions like novel writing, try this [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LjY75GoYUCB8"
   },
   "outputs": [],
   "source": [
    "from unsloth.chat_templates import get_chat_template\n",
    "\n",
    "tokenizer = get_chat_template(\n",
    "    tokenizer,\n",
    "    chat_template = \"chatml\", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth\n",
    "    mapping = {\"role\" : \"from\", \"content\" : \"value\", \"user\" : \"human\", \"assistant\" : \"gpt\"}, # ShareGPT style\n",
    "    map_eos_token = True, # Maps <|im_end|> to </s> instead\n",
    ")\n",
    "\n",
    "def formatting_prompts_func(examples):\n",
    "    convos = examples[\"conversations\"]\n",
    "    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]\n",
    "    return { \"text\" : texts, }\n",
    "\n",
    "from datasets import load_dataset\n",
    "dataset = load_dataset(\"philschmid/guanaco-sharegpt-style\", split = \"train\")\n",
    "dataset = dataset.map(formatting_prompts_func, batched = True,)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cHiVoToneynS"
   },
   "source": [
    "Let's see how the `ChatML` format works by printing the 5th element"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4GSuKSSbpYKq",
    "outputId": "2bc9b5a8-70c0-4924-b456-0138442031bd"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'from': 'human',\n",
       "  'value': 'What is the typical wattage of bulb in a lightbox?'},\n",
       " {'from': 'gpt',\n",
       "  'value': 'The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.'},\n",
       " {'from': 'human',\n",
       "  'value': 'Rewrite your description of the typical wattage of a bulb in a lightbox to only include the key points in a list format.'}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset[5][\"conversations\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "U5iEWrUkevpE",
    "outputId": "4f6de3db-4c48-41b1-e505-f4efb1674a38"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bos><|im_start|>user\n",
      "What is the typical wattage of bulb in a lightbox?<|im_end|>\n",
      "<|im_start|>assistant\n",
      "The typical wattage of a bulb in a lightbox is 60 watts, although domestic LED bulbs are normally much lower than 60 watts, as they produce the same or greater lumens for less wattage than alternatives. A 60-watt Equivalent LED bulb can be calculated using the 7:1 ratio, which divides 60 watts by 7 to get roughly 9 watts.<|im_end|>\n",
      "<|im_start|>user\n",
      "Rewrite your description of the typical wattage of a bulb in a lightbox to only include the key points in a list format.<|im_end|>\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(dataset[5][\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GuKOAUDpUeDL"
   },
   "source": [
    "If you're looking to make your own chat template, that also is possible! You must use the Jinja templating regime. We provide our own stripped down version of the `Unsloth template` which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.\n",
    "\n",
    "More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "p31Z-S6FUieB"
   },
   "outputs": [],
   "source": [
    "unsloth_template = \\\n",
    "    \"{{ bos_token }}\"\\\n",
    "    \"{{ 'You are a helpful assistant to the user\\n' }}\"\\\n",
    "    \"{% endif %}\"\\\n",
    "    \"{% for message in messages %}\"\\\n",
    "        \"{% if message['role'] == 'user' %}\"\\\n",
    "            \"{{ '>>> User: ' + message['content'] + '\\n' }}\"\\\n",
    "        \"{% elif message['role'] == 'assistant' %}\"\\\n",
    "            \"{{ '>>> Assistant: ' + message['content'] + eos_token + '\\n' }}\"\\\n",
    "        \"{% endif %}\"\\\n",
    "    \"{% endfor %}\"\\\n",
    "    \"{% if add_generation_prompt %}\"\\\n",
    "        \"{{ '>>> Assistant: ' }}\"\\\n",
    "    \"{% endif %}\"\n",
    "unsloth_eos_token = \"eos_token\"\n",
    "\n",
    "if False:\n",
    "    tokenizer = get_chat_template(\n",
    "        tokenizer,\n",
    "        chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token\n",
    "        mapping = {\"role\" : \"from\", \"content\" : \"value\", \"user\" : \"human\", \"assistant\" : \"gpt\"}, # ShareGPT style\n",
    "        map_eos_token = True, # Maps <|im_end|> to </s> instead\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "idAEIeSQ3xdS"
   },
   "source": [
    "<a name=\"Train\"></a>\n",
    "### Train the model\n",
    "Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "95_Nn-89DhsL"
   },
   "outputs": [],
   "source": [
    "from trl import SFTTrainer, SFTConfig\n",
    "\n",
    "trainer = SFTTrainer(\n",
    "    model = model,\n",
    "    tokenizer = tokenizer,\n",
    "    train_dataset = dataset,\n",
    "    packing = False,  # Can make training 5x faster for short sequences.\n",
    "    args = SFTConfig(\n",
    "        per_device_train_batch_size = 1,\n",
    "        gradient_accumulation_steps = 4,\n",
    "        warmup_steps = 5,\n",
    "        max_steps = 20,\n",
    "        learning_rate = 2e-4,\n",
    "        optim = \"adamw_8bit\",\n",
    "        weight_decay = 0.001,\n",
    "        lr_scheduler_type = \"linear\",\n",
    "        seed = 3407,\n",
    "        dataset_text_field = \"text\",\n",
    "        report_to = \"none\",  # Use TrackIO/WandB etc\n",
    "        max_grad_norm = 0.3,\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2ejIt2xSNKKp",
    "outputId": "27e0dbb5-f81e-445b-8d61-75f1c2b2b73a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU = Tesla T4. Max memory = 14.748 GB.\n",
      "5.967 GB of memory reserved.\n"
     ]
    }
   ],
   "source": [
    "# @title Show current memory stats\n",
    "gpu_stats = torch.cuda.get_device_properties(0)\n",
    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 791
    },
    "id": "yqxqAZ7KJ4oL",
    "outputId": "d92834e7-9463-4040-94e6-672243709216"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1\n",
      "   \\\\   /|    Num examples = 9,033 | Num Epochs = 1\n",
      "O^O/ \\_/ \\    Batch size per device = 1 | Gradient Accumulation steps = 4\n",
      "\\        /    Total batch size = 4 | Total steps = 20\n",
      " \"-____-\"     Number of trainable parameters = 50,003,968\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='20' max='20' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [20/20 02:33, Epoch 0/1]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Step</th>\n",
       "      <th>Training Loss</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2.941400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>2.585900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1.614300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2.335900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>5</td>\n",
       "      <td>2.123000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>6</td>\n",
       "      <td>1.364300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>7</td>\n",
       "      <td>2.392600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>8</td>\n",
       "      <td>1.775400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>9</td>\n",
       "      <td>2.125000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>10</td>\n",
       "      <td>1.273400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>11</td>\n",
       "      <td>1.502900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>12</td>\n",
       "      <td>1.802700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>13</td>\n",
       "      <td>1.099600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>14</td>\n",
       "      <td>1.497100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>15</td>\n",
       "      <td>2.371100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>16</td>\n",
       "      <td>1.617200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>17</td>\n",
       "      <td>2.402300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>18</td>\n",
       "      <td>1.746100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>19</td>\n",
       "      <td>1.621100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>20</td>\n",
       "      <td>1.919900</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "trainer_stats = trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "pCqnaKmlO1U9",
    "outputId": "90cc348e-4418-4ecd-faf6-cd37595a86d5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "164.0222 seconds used for training.\n",
      "2.73 minutes used for training.\n",
      "Peak reserved memory = 12.316 GB.\n",
      "Peak reserved memory for training = 6.349 GB.\n",
      "Peak reserved memory % of max memory = 83.51 %.\n",
      "Peak reserved memory for training % of max memory = 43.05 %.\n"
     ]
    }
   ],
   "source": [
    "# @title Show final memory and time stats\n",
    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
    "print(\n",
    "    f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\"\n",
    ")\n",
    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ekOmTR1hSNcr"
   },
   "source": [
    "<a name=\"Inference\"></a>\n",
    "### Inference\n",
    "Let's run the model! Since we're using `ChatML`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kR3gIAX-SM2q",
    "outputId": "8e932364-849a-40ad-e0b5-5f5ffd028f9c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<|im_start|> is already a token. Skipping.\n",
      "<|im_end|> is already a token. Skipping.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['<bos><|im_start|>user\\nContinue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\\n<|im_start|>assistant\\nThe next number in the Fibonacci sequence is 13.<|im_end|>']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from unsloth.chat_templates import get_chat_template\n",
    "\n",
    "tokenizer = get_chat_template(\n",
    "    tokenizer,\n",
    "    chat_template = \"chatml\", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth\n",
    "    mapping = {\"role\" : \"from\", \"content\" : \"value\", \"user\" : \"human\", \"assistant\" : \"gpt\"}, # ShareGPT style\n",
    "    map_eos_token = True, # Maps <|im_end|> to </s> instead\n",
    ")\n",
    "\n",
    "FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
    "\n",
    "messages = [\n",
    "    {\"from\": \"human\", \"value\": \"Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,\"},\n",
    "]\n",
    "inputs = tokenizer.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize = True,\n",
    "    add_generation_prompt = True, # Must add for generation\n",
    "    return_tensors = \"pt\",\n",
    ").to(\"cuda\")\n",
    "\n",
    "outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)\n",
    "tokenizer.batch_decode(outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CrSvZObor0lY"
   },
   "source": [
    " You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "e2pEuRb1r2Vg",
    "outputId": "6b9ace79-d6f3-41aa-9e04-879506f14f4b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bos><|im_start|>user\n",
      "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n",
      "<|im_start|>assistant\n",
      "The next number in the Fibonacci sequence is 13.<|im_end|>\n"
     ]
    }
   ],
   "source": [
    "FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
    "\n",
    "messages = [\n",
    "    {\"from\": \"human\", \"value\": \"Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,\"},\n",
    "]\n",
    "inputs = tokenizer.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize = True,\n",
    "    add_generation_prompt = True, # Must add for generation\n",
    "    return_tensors = \"pt\",\n",
    ").to(\"cuda\")\n",
    "\n",
    "from transformers import TextStreamer\n",
    "text_streamer = TextStreamer(tokenizer)\n",
    "_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uMuVrWbjAzhc"
   },
   "source": [
    "<a name=\"Save\"></a>\n",
    "### Saving, loading finetuned models\n",
    "To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.\n",
    "\n",
    "**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "upcOlWe7A1vc"
   },
   "outputs": [],
   "source": [
    "model.save_pretrained(\"lora_model\")  # Local saving\n",
    "# model.push_to_hub(\"your_name/lora_model\", token = \"...\") # Online saving"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AEEcJ4qfC7Lp"
   },
   "source": [
    "Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "MKX_XKs_BNZR",
    "outputId": "e8cff9c9-3e85-46ea-9d6a-735935fc88a3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bos><|im_start|>user\n",
      "What is a famous tall tower in Paris?<|im_end|>\n",
      "<|im_start|>assistant\n",
      "The Eiffel Tower is a famous tall tower in Paris. It was built in 1889 to commemorate the 100th anniversary of the French Revolution. It is 324 meters tall and is one of the most recognizable landmarks in the world.<|im_end|>\n"
     ]
    }
   ],
   "source": [
    "if False:\n",
    "    from unsloth import FastLanguageModel\n",
    "    model, tokenizer = FastLanguageModel.from_pretrained(\n",
    "        model_name = \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n",
    "        max_seq_length = max_seq_length,\n",
    "        dtype = dtype,\n",
    "        load_in_4bit = load_in_4bit,\n",
    "    )\n",
    "    FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n",
    "\n",
    "messages = [\n",
    "    {\"from\": \"human\", \"value\": \"What is a famous tall tower in Paris?\"},\n",
    "]\n",
    "inputs = tokenizer.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize = True,\n",
    "    add_generation_prompt = True, # Must add for generation\n",
    "    return_tensors = \"pt\",\n",
    ").to(\"cuda\")\n",
    "\n",
    "from transformers import TextStreamer\n",
    "text_streamer = TextStreamer(tokenizer)\n",
    "_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QQMjaNrjsU5_"
   },
   "source": [
    "You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yFfaXG0WsQuE"
   },
   "outputs": [],
   "source": [
    "if False:\n",
    "    # I highly do NOT suggest - use Unsloth if possible\n",
    "    from peft import AutoModelForPeftCausalLM\n",
    "    from transformers import AutoTokenizer\n",
    "\n",
    "    model = AutoModelForPeftCausalLM.from_pretrained(\n",
    "        \"lora_model\",  # YOUR MODEL YOU USED FOR TRAINING\n",
    "        load_in_4bit = load_in_4bit,\n",
    "    )\n",
    "    tokenizer = AutoTokenizer.from_pretrained(\"lora_model\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f422JgM9sdVT"
   },
   "source": [
    "### Saving to float16 for VLLM\n",
    "\n",
    "We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iHjt_SMYsd3P"
   },
   "outputs": [],
   "source": [
    "# Merge to 16bit\n",
    "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_16bit\",)\n",
    "if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_16bit\", token = \"\")\n",
    "\n",
    "# Merge to 4bit\n",
    "if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_4bit\",)\n",
    "if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_4bit\", token = \"\")\n",
    "\n",
    "# Just LoRA adapters\n",
    "if False:\n",
    "    model.save_pretrained(\"model\")\n",
    "    tokenizer.save_pretrained(\"model\")\n",
    "if False:\n",
    "    model.push_to_hub(\"hf/model\", token = \"\")\n",
    "    tokenizer.push_to_hub(\"hf/model\", token = \"\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TCv4vXHd61i7"
   },
   "source": [
    "### GGUF / llama.cpp Conversion\n",
    "To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.\n",
    "\n",
    "Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):\n",
    "* `q8_0` - Fast conversion. High resource use, but generally acceptable.\n",
    "* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.\n",
    "* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FqfebeAdT073"
   },
   "outputs": [],
   "source": [
    "# Save to 8bit Q8_0\n",
    "if False: model.save_pretrained_gguf(\"model\", tokenizer,)\n",
    "if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, token = \"\")\n",
    "\n",
    "# Save to 16bit GGUF\n",
    "if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"f16\")\n",
    "if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, quantization_method = \"f16\", token = \"\")\n",
    "\n",
    "# Save to q4_k_m GGUF\n",
    "if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"q4_k_m\")\n",
    "if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, quantization_method = \"q4_k_m\", token = \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.\n",
    "\n",
    "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
    "\n",
    "Some other links:\n",
    "1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)\n",
    "2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)\n",
    "3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)\n",
    "6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!\n",
    "\n",
    "<div class=\"align-center\">\n",
    "  <a href=\"https://unsloth.ai\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
    "  <a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n",
    "  <a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a>\n",
    "\n",
    "  Join Discord if you need help + \u2b50\ufe0f <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> \u2b50\ufe0f\n",
    "\n",
    "  This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).\n",
    "</div>\n"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {}
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}